A Part-of-Speech Tag Clustering for a Word Prediction System in Portuguese Language

نویسندگان

  • Daniel Cruz Cavalieri
  • Teodiano Freire Bastos Filho
  • Mário Sarcinelli Filho
  • Sira E. Palazuelos-Cagigas
  • Javier Macías Guarasa
  • José Luis Martín Sánchez
چکیده

This paper presents an automatic method for reducing the part-of-speech tagset to be considered by a word prediction system in Portuguese. The method is based on a similarity measure applied to a association matrix, generated by employing a odds ratio association measure in the bigrams of parts-of-speech (bipos) probability distribution in a corpus. The results reported in this paper show that using the proposed clustering method with an appropriate threshold value over the similarity has the potential to improve the word prediction system. Moreover, it makes possible to use new clustering techniques such as fuzzy clustering. The results also show that when using a word prediction system based on a syntactic model, the clustering cannot be performed between the major syntactic categories, even if the clusters generated seem correct from a linguistic point of view.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design and Implementation of an Intelligent Part of Speech Generator

The aim of this paper is to report on an attempt to design and implement an intelligent system capable of generating the correct part of speech for a given sentence while the sentence is totally new to the system and not stored in any database available to the system. It follows the same steps a normal individual does to provide the correct parts of speech using a natural language processor. It...

متن کامل

معرفی رویکردی ماشینی با استفاده از الگوریتم لسک و برچسبدهی نحوی جهت رفع ابهام از معنای کلمات

The present study introduces a machine-based approach for word sense disambiguation (WSD). In Persian, a morphologically complex language, POS tag which lots of homographs are made, one way for doing WSD is allocating the right Part Of Speech (POS) tags to words prior to WSD. Since the frequency of noun and adjective homographs in different Persian POS tag text corpuses is high, POS tag disambi...

متن کامل

Morphosyntactic Disambiguation for TTS Systems

The purpose of this paper is to present the development of a morphossyntactic disambiguation system (or part-of-speech tagging system) which is intended to be used as a component of a Text-to-Speech (TTS) system for European Portuguese. In the development of the tagger, we compared two approaches: a probabilistic-based approach and a hybrid approach. Besides comparing these two approaches, this...

متن کامل

Word Context and Token Representations from Paradigmatic Relations and Their Application to Part-of-Speech Induction

Representation of words as dense real vectors in the Euclidean space provides an intuitive definition of relatedness in terms of the distance or the angle between one another. Regions occupied by these word representations reveal syntactic and semantic traits of the words. On top of that, word representations can be incorporated in other natural language processing algorithms as features. In th...

متن کامل

Part-of-Speech Tagging of Portuguese Using Hidden Markov Models with Character Language Model Emissions

This paper presents a probabilistic approach for POS tagging that combines HMMs and character language models being applied to Portuguese texts. In this approach, the emission probabilities for each hidden state in a HMM are estimated by a proper character language model. The tagger built has been trained and tested on Bosque, a subset of Floresta Sintá(c)tica treebank, reaching 96.2% accuracy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 47  شماره 

صفحات  -

تاریخ انتشار 2011